Counting Quail, a second time

For all problem sets, we will provide you with a single R Markdown file. Please complete the problem set within a local copy of this file. To turn in, please upload a fully knitted html version. Make sure to keep echo=TRUE, as appropriate, to show your coding.

In Problem Set 6 you were introduced to a stunning dataset analyzing count data for the Angeleno Quail (Callipepla curtii) collected in 2015. As a reminder, three times during the breeding season, surveyors visited 267 survey points doing targeted surveys for the quail.

Last time, you fit a mixed-effects model to the data and found strong effects of elevation, forest cover, an interaction, and wind. You noticed, however, that your dataset included a lot of zeroes, but you didn’t let that disturb you! After this week’s lessons, however, you’ve become a bit more worried, and you wish to try fitting the data in a way that accounts for imperfect detection. Since you have count data, you decide to use a N-mixture model. Like like an occupancy model, but for abundance data, the N-mixture model separates the underlying abundance state process (here, the factors that determine quail abundance) from the observation process (here, how imperfect detection reduces the number of quail actually seen), and models the data as a mixture of these two processes.

After talking to the biologists who surveyed the species, they tell you that they believe that forest cover and elevation likely influence quail abundance, while wind is a very strong factor in their ability to detect quail (but as a variable that changes by the day, wind doesn’t influence their true underlying abundance). The biologists are unsure whether canopy cover (which correlates with ground cover, as well as auditory transmission) impacts detection probability, so it’s probably safer to test for it. You use this natural history knowledge to guide you in your model parameterization.

The data are identical as used for Problem Set 6, so re-load in the R data object.

load("ProblemSet6.Rdata")

Your goal is to run a single N-mixture model in JAGS and learn what you can about the species from the fitted model. The model should include all the major components of this week’s problem set: the effect of elevation, forest cover, the interaction, and wind. Note, however, that you do not have to include a random effect for site with this N-mixture model, because N-mixture – like occupancy – models, assumes that the true underlying state does not change across repeat visits. Thus the “pseudoreplication” of repeat visits is what informs the detection parameter, but abundance covariates are parameterized on the latent variable, \(N\), which is assumed to be unchanging across visits (an assumption called “closure”). You should realize, consequently, that the true abundance at a site (\(N_j\)) is only estimated once for each of \(j\) sites. Consequently, your model should follow the general structure of an N-mixture model, which is:

\(y_{ij}\sim binomial(p_{jk}, N_j)\)

\(N_j \sim Poisson(\lambda_j)\)

where \(p_{jk}\) is a probability of detection at site \(j\) and visit \(k\) that can be modeled as a function of covariates, and \(\lambda_j\) is the expected true abundance at site \(j\), which can also be modeled as a function of covariates.

  1. Draw a DAG for the model you are about the create. Display your DAG below using the ggdag R package.

  2. Using your DAG, write out the JAGS model code for this model. Use appropriate vague priors for all parameters. You will need to create two nested for-loops, a first loop over every site j, and a second loop over each visit k. Thus, your process model should be for every count[j,k]. Keep in mind that your deterministic process and observation models will use different link functions, which means that a truly ‘vague’ prior for your parameters may differ depending on whether they are an abundance parameter or a detection parameter.

qs_mod<-function(){
  b0 ~ dnorm(0,0.00001)
  b_elev ~ dnorm(0,0.00001)
  b_forest ~ dnorm(0,0.00001)
  b_int ~ dnorm(0,0.00001)
  alpha_0 ~ dnorm(0,0.00001)
  alpha_wind ~ dnorm(0,0.00001)
  alpha_forest ~ dnorm(0,0.00001)
  tau_site ~ dgamma(0.001,0.001)
  sigma_site <- 1 / sqrt(tau_site)
  for (j in 1:site_num){
    N[j] ~ dpois(lam[j])
    log(lam[j])<-b0+b_elev*elev[j]+b_forest*forest[j]+b_int*elev[j]*forest[j]+b_site[j]
    b_site[j] ~ dnorm(0,tau_site)
    for (k in 1:visit_num){
      logit(p[j,k])=alpha_0+alpha_wind*wind[j,k]+alpha_forest*forest[j]
      count[j,k] ~ dbinom(p[j,k],N[j])
      count.new[j,k] ~ dbinom(p[j,k],N[j])
    }
  }
}
  1. Given that we ran an analysis last week and estimated parameters for this species. Would it be advantageous to use the findings from last week’s analysis as informed priors for the current analysis? Why or why not?
# I Do not think it would be advantagous to use the finding sfrom last week's analysis as informed priors for the current analysis. The issue is that in last week analysis, one of the underlying assumptions is that there is perfect observation of the indiviudals within the community.However, in this week we are adding imperfect observaiton into the model. Thus if we use last week's estimated parameters as priors, the priors may not necessarily cover all the pisterior distirbution of the parameter this weeks.More speciically, it is very likely that the posterior distirbution for the covariates related to counts such as b_0 b_forest we gained form this week would be consistently larger than what have been got last week due to the effect of imperefect observaiton being added in. 

#Furthermore, we are also removign the effect of wind out from the covaraits of counting, thus this would chagne the strucure of the model, which would cause the priors to be inaccurate to be used in this week's analysis.
  1. Prepare your list of data to send to JAGS and run your JAGS model. Check to make sure it has converged. If the convergence is skeptical, change your MCMC parameters so that the model convergences. As discussed in class and shown in Lab 7, if you receive a Node inconsistent with parents error from JAGS, this indicates that you need to give JAGS plausible initial values for the latent variable, N.
set.seed(2025)
count=ps.data[["count"]]
elevation=ps.data[["elev"]]
forest=ps.data[["forest"]]
wind=ps.data[["wind"]]
avg_count=apply(count,1,mean)
site_num=nrow(wind)
visit_num=ncol(wind) 
site=rep(1:site_num,each=visit_num)
visit=rep(1:visit_num,times=site_num)
total_sample=site_num*visit_num

jags_data_qs<-list(
  site_num=site_num,
  visit_num=visit_num,
  count=count,
  wind=wind,
  elev=elevation,
  forest=forest
)

qs.fit<-jags(data=jags_data_qs,parameters.to.save=c("b0", "b_elev","b_forest","b_int","sigma_site","alpha_0","alpha_wind","alpha_forest"),model.file=qs_mod,
             inits=list(list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1)),
              n.chains=3,
              n.iter=20000,
              n.burnin=5000,
              n.thin=10)
## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 801
##    Unobserved stochastic nodes: 1343
##    Total graph size: 7492
## 
## Initializing model
qs.check<-jags(data=jags_data_qs,parameters.to.save=c("count.new","p"),model.file=qs_mod,
              inits=list(list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1)),
                    n.chains=3,
                    n.iter=20000,
                    n.burnin=5000,
                    n.thin=10)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 801
##    Unobserved stochastic nodes: 1343
##    Total graph size: 7492
## 
## Initializing model
MCMCsummary(qs.fit) 
##                       mean          sd          2.5%           50%        97.5%
## alpha_0        -0.89359679  0.11327924   -1.12064343 -8.930820e-01   -0.6728587
## alpha_forest    0.01305130  0.16897255   -0.29803828  9.500327e-03    0.3376666
## alpha_wind     -2.98527991  0.13919744   -3.24938669 -2.989302e+00   -2.7015073
## b0              0.62427944  0.08908459    0.44919141  6.251070e-01    0.7980839
## b_elev         -2.05852386  0.11907291   -2.29740959 -2.058708e+00   -1.8250669
## b_forest        2.14059057  0.12068901    1.90594868  2.143818e+00    2.3740167
## b_int           1.29513484  0.16953891    0.96008017  1.295504e+00    1.6244017
## deviance     1323.42152541 28.44176862 1269.67480612  1.322952e+03 1381.8100583
## sigma_site      0.07152209  0.03542327    0.02278854  6.483815e-02    0.1561256
##              Rhat n.eff
## alpha_0      1.01   180
## alpha_forest 1.00   226
## alpha_wind   1.01   238
## b0           1.01   587
## b_elev       1.00   960
## b_forest     1.00   665
## b_int        1.00  1043
## deviance     1.02   216
## sigma_site   1.00  1422
# most of the Rhat are either 1 or close to 1 thus the model has converged
#However, there are count.new values that have Rhat NaN, which could be problematic?
  1. Make traceplots for your slope and intercept parameters. Check Prior-Posterior Overlap. Note that you may wish to do that in two different commands, if you used multiple priors in your model.
traceplot(qs.fit)

pr_norm=rnorm(15000,mean=0,sd=sqrt(1/0.00001))
pr_gamma=1/sqrt(rgamma(15000,0.01,0.01))
MCMCtrace(qs.fit,params="b0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b0", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="b_elev",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_elev", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="b_int",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_int", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="b_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_forest", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="alpha_0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_0", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="alpha_wind",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_wind", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="alpha_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_forest", priors = pr_norm, :
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

  1. Your traceplots should look quite different for the parameters for detection versus the parameters for abundance (e.g., compare traceplots for the effect of forest on abundance vs detection). Why should MCMC be more or less efficient for abundance versus detection processes?
#MCMC for abundance is much more efficient than the detection processes. I think one of the main reasons is that the detection parameters are highly correlated to N, hence the value of the abundance parameters in each MCMC step. Thus, as a result, the parameters in the detection processes would demonstrate a higher interrelation between steps when put into a MCMC process: besides the autocorrelation that is introduced by the MCMC process of itself, they are also gaining autocorrelation from the MCMC process of the abundance parameters. Thus, the MCMC tends to be less efficient for the detection processes. 
  1. Using the posterior summary, what can we learn about our hypothesized variables of interest?
MCMCsummary(qs.fit)
##                       mean          sd          2.5%           50%        97.5%
## alpha_0        -0.89359679  0.11327924   -1.12064343 -8.930820e-01   -0.6728587
## alpha_forest    0.01305130  0.16897255   -0.29803828  9.500327e-03    0.3376666
## alpha_wind     -2.98527991  0.13919744   -3.24938669 -2.989302e+00   -2.7015073
## b0              0.62427944  0.08908459    0.44919141  6.251070e-01    0.7980839
## b_elev         -2.05852386  0.11907291   -2.29740959 -2.058708e+00   -1.8250669
## b_forest        2.14059057  0.12068901    1.90594868  2.143818e+00    2.3740167
## b_int           1.29513484  0.16953891    0.96008017  1.295504e+00    1.6244017
## deviance     1323.42152541 28.44176862 1269.67480612  1.322952e+03 1381.8100583
## sigma_site      0.07152209  0.03542327    0.02278854  6.483815e-02    0.1561256
##              Rhat n.eff
## alpha_0      1.01   180
## alpha_forest 1.00   226
## alpha_wind   1.01   238
## b0           1.01   587
## b_elev       1.00   960
## b_forest     1.00   665
## b_int        1.00  1043
## deviance     1.02   216
## sigma_site   1.00  1422
#The mean of b0 is 0.624, meaning that the average count for all sites, without the consideration of all the other effect is e^0.624

#Elevation has a highly-certain and strong negative impact on quail counts. The mean of b_elev is -2.059. Thus, when elevation increase by 1, the mean count of the site will change by a factor of e^-2.059

#Both canopy cover and the ineraction between canopy cover and elevation have a positive impact on the count fo quails, and we are very much confident about this effect. On average, when canopy cover and the interaction between canopy cover and elevation increase by 1 unit, the expected abundance of quails at that site is increase by a factor of e^2.14 and e^1.29, respectively

#Sigma_site is positive but relative low: 0.0715, suggesting that most of the variation has been explained by the existing parameters and site  do not contain any hidden variables that offer a strong explanatory power to the count of quails.

#alpha_0's effect on observation probability is strongly negative with high certainty. The mean of alpha_0 is -0.894, thus the average odd (not probability) of successful observation without considering other effect is e^-0.894.

#alpha_wind has a very strong negative and highly certain effect on observation success. The mean of alpha_wind is -2.99, suggesting that the average odd of successful observation would change by a factor of e^-2.98 when wind increase by 1 unit. Thus we are highly confident that wind negatively impacts our observation success for quails

#alpha_forest has a 95% confidence interval overlap with 0, thus we are uncertain whether it has any actual impact on the observation success or not. Moreover the mean value is small: 0.0131. Thus it is safe to say that canopy cover does not actually contribute much to the detection process of quails
  1. Let’s double check that this model is an adequate fit to the data. Last week we conducted two posterior predictive checks (PPC), using a test statistic of mean(y) and sd(y). Remember that the model fit well for the mean but not the sd. Repeat this now, with our N-mixture model, by calculating the observed and posterior distributions for both test statistics, and plot each (make sure your plots show both observed lines and the full posterior). Calculate the Bayesian p-value for each as well.
ppd=MCMCchains(qs.check,params="count.new")
ppd_mean=apply(ppd, 1, mean)
ppd_sd=apply(ppd, 1, sd)
obs_mean=mean(count)
obs_sd=sd(count)
df=data.frame(mean=ppd_mean,sd=ppd_sd)

hist(ppd_mean,breaks=20,main="Posterior Prediction Check for Mean",xlab="Simulated Means",ylab="Frequency",xlim=c(min(ppd_mean,obs_mean)-1,max(ppd_mean,obs_mean)+1))
abline(v=obs_mean,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed mean",col="red",lty=2,lwd=2)

hist(ppd_sd,breaks=20,main="Posterior Prediction Check for SD",xlab="Simulated Sd",ylab="Frequency",xlim=c(min(ppd_sd,obs_sd)-1,max(ppd_sd,obs_sd)+1))
abline(v=obs_sd,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed Standard Deviation",col="red",lty=2,lwd=2)

p_mean=mean(ppd_mean>obs_mean)
p_sd=mean(ppd_sd>obs_sd)

p_mean
## [1] 0.4791111
p_sd
## [1] 0.6746667
#The p values are 0 and 1 
#this is reasonable beacuse we are including observation probabilty into the model thus the estiamted average count should be larger than the observed value so that an adequate number could be observed

#For the standard deviation, the posterior predictive values are consistently smaller than the observed ones. This can be attributed to the fact that different visits introduce additional noise into the observed dataset. However, this extra variability is accounted for and partially corrected by the inclusion of a random effect in the model in both the abundance and observation process, which reduces the variation in the predicted dataset.
  1. Using your posterior, create three plots (a, b, c), each showing how true quail abundance is predicted to change with forest cover. Plot (a) should show “low elevation” (elev = -1), (b) should show middle elevation (elev = 0), and (c) should show high elevation (elev = 1). Your y-axis should be “Predicted abundance” and your x-axis should be “Elevation (scaled)”. Label each plot. Each plot should clearly show a median prediction line, as well as a 89% credible interval in prediction around that median (to accomplish this, you will have to derive predictions from the full posterior sample of your parameters).
ppd=MCMCchains(qs.check,params="count.new")
b0=MCMCchains(qs.fit,params="b0")
b_elev=MCMCchains(qs.fit,params="b_elev")
b_forest=MCMCchains(qs.fit,params="b_forest")
b_int=MCMCchains(qs.fit,params="b_int")

forest_cov_pred=seq(min(forest),max(forest),by=0.01)
elev_discrete=c(-1,0,1)

plot_list <- list()
for (i in elev_discrete){
  median_ppd=c()
  lower_ppd=c()
  upper_ppd=c()
  elev_val_used=c()
  prediction=c()
  for (j in 1:length(forest_cov_pred)){
  forest_cov_val=forest_cov_pred[j]
  count=exp(b0+b_elev*i+b_forest*forest_cov_val+b_int*i*forest_cov_val)
  lower_ppd[j]=quantile(count,0.055)
  upper_ppd[j]=quantile(count,0.945)
  elev_val_used[j]=i
  prediction[j]=median(count)
  }
  plot_list[[as.character(i)]] <- data.frame(
    forest_cover=forest_cov_pred,
    median_pred=prediction,
    lower=lower_ppd,
    upper=upper_ppd,
    elevation=elev_val_used)
}
ggplot(plot_list[["-1"]],aes(x=forest_cover))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=-1)",x="Forest Cover",y="Predicted Abundance",col="Key")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(plot_list[["0"]],aes(x=forest_cover))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=0)",x="Forest Cover",y="Predicted Abundance",col="Key")

ggplot(plot_list[["1"]],aes(x=forest_cover))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=1)",x ="Forest Cover",y = "Predicted Abundance",col="Key")

  1. Compare your plots for #9 above to your similar plots from Problem Set 6, where you modeled this data using a GLMM. How are the plots similar? How are they different? If there are major differences, explain why the N-mixture model is producing these differences?
#The plots looks similar in terms of their shape. This is reasonable as most of the biological process in terms of how elevation and forest cover are descibing teh abundance are similar between the two models thus they should be similar.

#However there is one thign to notice is that the plots above predict much more count per site when comapred to what have been produced in PS6, this is because that we have now taken into account the process detection, thus the number of quails per sites need to be larger than the amount of quails that are actually observed/the amounts of quails when observation is perfect. Thus the counts in the plots above are much higher than that in PS 6
  1. Unlike our GLMM model, our N-mixture model is modeling two key parameters: true abundance (\(N\)) and probability of detection (\(p\)). Similar to question #9, create a predicted response plot showing the median response and 89% credible ribbon, but this time, plot the probability of detection as a function of wind.
a0=MCMCchains(qs.fit,params="alpha_0")
a_forest=MCMCchains(qs.fit,params="alpha_forest")
a_wind=MCMCchains(qs.fit,params="alpha_wind")
p_val_ppd=MCMCchains(qs.check,params="p")

wind_cov_pred=seq(min(wind),max(wind),by=0.01)
forest_discrete=c(-1,0,1)

plot_list <- list()
for (i in forest_discrete){
  median_ppd=c()
  lower_ppd=c()
  upper_ppd=c()
  forest_val_used=c()
  prediction=c()
  for (j in 1:length(wind_cov_pred)){
  wind_val=wind_cov_pred[j]
  p_val=exp(a0+a_forest*i+a_wind*wind_val)/(1+exp(a0+a_forest*i+a_wind*wind_val))
  lower_ppd[j]=quantile(p_val,0.055)
  upper_ppd[j]=quantile(p_val,0.945)
  forest_val_used[j]=i
  prediction[j]=median(p_val)
  }
  plot_list[[as.character(i)]] <- data.frame(
    wind_val_plot=wind_cov_pred,
    median_pred=prediction,
    lower=lower_ppd,
    upper=upper_ppd,
    forest_val_plot=forest_val_used)
}
ggplot(plot_list[["-1"]],aes(x=wind_val_plot))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=-1)",x="Wind",y="Probability of observation",col="Key")

ggplot(plot_list[["0"]],aes(x=wind_val_plot))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=0)",x="Wind",y="Probability of observation",col="Key")

ggplot(plot_list[["1"]],aes(x=wind_val_plot))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=1)",x="Wind",y="Probability of observation",col="Key")